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1 Introduction 

Estimation of distribution algorithms (EDAs) O HIX [TBI UM ES] replace standard variation 
operators of genetic and evolutionary algorithms by building and sampling probabilistic models of 
promising candidate solutions. Already some of the earliest estimation of distribution algorithms 
(EDAs) have completely ehminated the need for maintaining an explicit population of candidate 
solutions used in most standard evolutionary algorithms, and they updated the probabilistic model 
incrementally using only a few candidate solutions at a time [21 [13] . The main advantage of such 
incremental EDAs is that memory complexity is greatly reduced. One of the must successful results 
of this line of research was the application of the compact genetic algorithm (cGA) [13j to a noisy 
problem with over one billion bits [111132]. Nonetheless, all incremental EDAs proposed in the past 
use either univariate models with no interactions between the variables or probabilistic models in 
the form of a tree. 

This paper proposes an incremental version of the Bayesian optimization algorithm (BOA), 
which uses Bayesian networks to model promising solutions and sample the new ones. The pro- 
posed algorithm is called the incremental BOA (iBOA). While many of the ideas can be adopted 
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from the work on other incremental EDAs, the design of iBOA poses one unique challenge — how 
to incrementally update a multivariate probabilistic model without either committing to a highly 
restricted set of structures at the beginning of the run or having to maintain all possible multi- 
variate statistics that can be useful throughout the run? We propose one solution to this challenge 
and outline another possible approach to tackling this problem. We then test iBOA on several de- 
composable problems to verify its robustness and scalability on boundedly difficult decomposable 
problems. Finally, we outline interesting topics for future work in this area. 

The paper starts by discussing related work in section [2j Section [3] outlines the standard 
population-based Bayesian optimization algorithm (BOA). Section|4]describes the incremental BOA 
(iBOA). Section [5] presents and discusses experimental results. Section [6] outlines some of the most 
important challenges for future research in this area. Finally, section [7| summarizes and concludes 
the paper. 

2 Background 

This section reviews some of the incremental estimation of distribution algorithms. Throughout 
the section, we assume that candidate solutions are represented by fixed-length binary strings, 
although most of the methods can be defined for fixed-length strings over any finite alphabet in a 
straightforward manner. 

2.1 Population-Based Incremental Learning (PBIL) 

The population-based incremental learning (PBIL) algorithm [2] was one of the first estimation of 
distribution algorithms and was mainly inspired by the equilibrium genetic algorithm (EGA) |17j . 
PBIL maintains a probabilistic model of promising solutions in the form of a probability vector. 
The probability vector considers only univariate probabilities and for each string position it stores 
the probability of a 1 in that position. For an n-bit string, the probability vector is thus a vector 
of n probabilities p = {pi,P2, ■ ■ ■ ,Pn) where pi encodes the probability of a 1 in the i-th position. 
Initially, all entries in the probability vector are set to 0.5, encoding the uniform distribution over 
all binary strings of length n. 

In each iteration of PBIL, a fixed number of binary strings are first generated from the 
current probability vector; for each new string and each string position i, the bit in the ith position 
is set to 1 with probability pi from the current probability vector (otherwise the bit is set to 0). 
The generated solutions are evaluated and A^^est best solutions are then selected from the new 
solutions based on the results of the evaluation where N^ggt < A^- The selected best solutions are 
then used to update the probability vector. Specifically, for each selected solution (xi,X2, . . . ,Xn), 
the probability vector p = {pi,p2, ■ ■ ■ ,Pn) is updated as follows: 

Pi Pi(l — A) + XjA for all i S {1, . . . , n}, 

where A € (0, 1) is the learning rate, typically set to some small value. If Xi = 1, then pi is increased; 
otherwise, pi is decreased. The rate of increase or decrease depends on the learning rate A and 
the current value of the corresponding probability- vector entry. In the original work on PBIL [2] , 
A^ = 200, Nbest = 2, and A = 0.005. 

Although PBIL does not maintain an explicit population of candidate solutions, the learning 
rate A can be used in a similar manner as the population-size parameter of standard population- 
based genetic and evolutionary algorithms. To simulate the effects of larger populations, A should 
be decreased; to simulate the effects of smaller populations, A should be increased. 
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2.2 Compact Genetic Algorithm (cGA) 



The compact genetic algorithm (cGA) [13] also maintains a probability vector instead of a popu- 
lation. Similarly as in PBIL, the initial probability vector corresponds to the uniform distribution 
over all n-bit binary strings and all its entries are thus set to 0.5. In each iteration, cGA generates 
two candidate solutions from the current probability vector. Then, the two solutions are evaluated 
and a tournament is executed between the two solutions. The winner w = {wi, . . . ,Wn) and the 
loser / = (/i, . . . , /„) of this tournament are then used to update the probability vector. 

Before presenting the update rule used in cGA, let us discuss the effects of a steady-state update 
on the univariate probabilities of the probability vector in a population of size N where the winner 
replaces the loser. If for any position i the winner contains a 1 in this position and the loser 
contains a in the same position, the probability pj of a 1 in this position would increase by 
On the other hand, if the winner contains a in this position and the loser contains a 1, then 
the probability of a 1 in this position would decrease by 1. Finally, if the winner and the loser 
contain the same bit in any position, the probability of a 1 in this position would not change. This 
update procedure can be simulated even without an explicit population using the following update 
rule [13]: 

{Pi - jj iiwi = Q and = 1 
Pi + ^ if = 1 and = 
Pi otherwise 

Although cGA does not maintain an explicit population, the parameter N serves as a replacement 
of the population-size parameter (similarly as A in PBIL). 

Performance of cGA can be expected to be similar to that of PBIL, if both algorithms are set up 
similarly. Furthermore, cGA should perform similarly to the simple genetic algorithm with uniform 
crossover with the population size N . Even more closely, cGA resembles the univariate marginal 
distribution algorithm (UMDA) [21] and the equilibrium genetic algorithm (EGA) [17]. 



2.3 EDA with Optimal Dependency Trees 

In the EDA with optimal dependency trees [3], dependency-tree models are used and it is thus 
necessary to maintain not only the univariate probabilities but also the pairwise probabilities for 
all pairs of string positions. The pairwise probabilities are maintained using an array A, which 
contains a number A[Xi = a,Xj = b] for every pair of variables (string positions) Xi and Xj 
and every combination of assignments a and b of these variables. A[Xi = a,Xj = b] represents an 
estimate of the number of solutions with Xi = a and Xj = b. Initially, all entries in A are initialized 
to some constant Cinit', for example, Cinit = 1000 may be used [3]. 

Given the array A, the marginal probabilities p{Xi = a, Xj = b) are estimated for every pair of 
variables Xi and Xj and every assignment of these variables as 



p{Xi = a, Xj = b) 



A[Xi = a,Xj = b] 



Then, a dependency tree is built that maximizes the mutual information between connected pairs 
of variables, where the mutual information between any two variables Xi and Xj is given by 

a,b 
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where marginal probabilities p{Xi = a) and p{Xj = b) are computed from the pairwise probabilities 
p{Xi = a, Xj = h). The tree may be built using a variant of Prim's algorithm for finding minimum 
spanning trees [30] , minimizing the Kullback-Liebler divergence between the empirical distribution 
and the dependency-tree model [6]. 

New candidate solutions can be generated from the probability distribution encoded by the 
dependency tree, which is defined as 



where r denotes the index of the root of the dependency tree and p{i) denotes the parent of Xi. 
The generation starts at the root, the value of which is generated using its univariate probabilities 
P{Xi), and then continues down the tree by always generating the variables the parents of which 
have already been generated. 

Each iteration of the dependency-tree EDA proceeds similarly as in PBIL. First, a dependency 
tree is built from A. Then, N candidate solutions are generated from the current dependency tree 
and Nicest best solutions are selected out of the generated candidates based on their evaluation. 
The selected best solutions are then used to update entries in the array A; for each solution 
X = [xi, . . . , Xn), the update rule is executed as follows: 



where a € (0, 1) is the decay rate; for example, a = 0.99 [3]. The above update rule is similar to 
that used in PBIL. 

While both PBIL and cGA use a probabilistic model based on only univariate probabilities, 
the EDA with optimal dependency trees is capable of encoding conditional dependencies between 
some pairs of string positions, enabling the algorithm to efficiently solve some problems that are 
intractable with PBIL and cGA. Nonetheless, using dependency-tree models is still insufficient 
to fully cover multivariate dependencies; this may yield the EDA with optimal dependency trees 
intractable on many decomposable problems with multivariate interactions [H 127^ \TU [ 134 1 [23] . 

3 Bayesian Optimization Algorithm (iBOA) 

This section describes the Bayesian optimization algorithm (BOA). First, the basic procedure of 
BOA is described. Next, the methods for learning and sampling Bayesian networks in BOA are 
briefly reviewed. 

3.1 Basic BOA Algorithm 

The Bayesian optimization algorithm (BOA) |26[ [27] evolves a population of candidate solutions 
represented by fixed-length vectors over a finite alphabet. In this paper we assume that candidate 
solutions are represented by n-bit binary strings, but none of the presented techniques is limited 
to only the binary alphabet. The first population of candidate solutions is typically generated at 
random according to the uniform distribution over all possible strings. 

Each iteration of BOA starts by selecting a population of promising candidate solutions from the 
current population. Any selection method used in population-based evolutionary algorithms can 
be used; for example, we can use binary tournament selection. Then, a Bayesian network is built 



p{Xi = xi,... 



Xn = Xn) = p{Xr = X^) JJp(Xi = = Xp(j)) 




aA[Xi = a, Xj = 5] + 1 if = a and Xj = b 
aAlXi = a,Xi = b] otherwise 
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Figure 1: Basic procedure of the Bayesian optimization algorithm (BOA). 



for the selected solutions. New solutions are generated by sampling the probability distribution 
encoded by the learned Bayesian network. Finally, the new solutions are incorporated into the 
original population; for example, this can be done replacing the entire old population with the new 
solutions. The procedure is terminated when some predefined termination criteria are reached; for 
example, when a solution of sufficient quality has been reached or when the population has lost 
diversity and it is unlikely that BOA will reach a better solution than the solution that has been 
found already. The procedure of BOA is visualized in figured! 

3.2 Bayesian Networks 

A Bayesian network (BNs) [23^ [IB] is defined by two components: 

Structure. The structure of a Bayesian network for n random variables is defined by an undirected 
acyclic graph where each node corresponds to one random variable and each edge defines a 
direct conditional dependency between the connected variables. The subset of nodes from 
which there exists an edge to the node are called the parents of this node. 

Parameters. Parameters of a Bayesian network define conditional probabilities of all values of 
each variable given any combination of values of the parents of this variable. 

A Bayesian network defines the joint probability distribution 

n 

p(Xi,...,xj = J]p(x,|n,), 

i=\ 

where Ilj are the parents of Xi andp(Xj|nj) is the conditional probability of given Ilj. Each vari- 
able directly depends on its parents. On the other hand, the network encodes many independence 
assumptions that may simplify the joint probability distribution significantly. 

Bayesian networks are more complex than decision trees discussed in section [2^31 allowing BOA 
to encode arbitrary multivariate dependencies. The estimation of Bayesian networks algorithm 
(EBNA) [8] and the learning factorized distribution algorithm (LFDA) [20] are also EDAs based 
on Bayesian network models. 
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3.3 Learning Bayesian Networks in BOA 



To learn a Bayesian network from the set of selected solutions, we must learn both the structure 
of the network as well as the parameters (conditional and marginal probabilities). 

To learn the parameters, the maximum likelihood estimation defines the conditional probability 
that Xi = Xi given that the parents are set as Ilj = vTj where Xi and vTj denote any assignment of 
the variable and its parents: 

p{Xi = Xi\Ui = TTi) = — , 

m{Ili = TTi) 

where m{Xi = XjjIIj = vrj) denotes the number of instances with Xi = xi and Ilj = vTj, and 
mijli = TTi) denotes the number of instances with Hj = vTj. 

To learn the structure of a Bayesian network, a greedy algorithm [14j is typically used. In the 
greedy algorithm for network construction, the network is initialized to an empty network with 
no edges. Then, in each iteration, an edge that improves the quality of the network the most is 
added until the network cannot be further improved or other user-specified termination criteria are 
satisfied. 

There are several approaches to evaluating the quality of a specific network structure. In this 
work, we use the Bayesian information criterion (BIC) [S^ to score network structures. BIC is 
a two-part minimum description length metric [12], where one part represents model accuracy, 
whereas the other part represents model complexity measured by the number of bits required to 
store model parameters. For simplicity, let us assume that the solutions are binary strings of fixed 
length n. BIC assigns the network structure B a score [33] 



BIC{B) = ^ (^-H{Xi\U,)N - 2in.li^g^^ 



where H{Xi\Ili) is the conditional entropy of Xi given its parents Ilj, n is the number of variables, 
and N is the population size (the size of the training data set). The conditional entropy H{Xi\Ili) 
is given by 

H{Xi\Ili) = - ^ p{Xi = Xi,Ui = TTi) log2p{Xi = Xi\Ili = TTi), 

where the sum runs over all instances of Xj and n, . 



3.4 Sampling Bayesian Networks in BOA 

The sampling can be done using the probabilistic logic sampling of Bayesian networks [15], which 
proceeds in two steps. The first step computes an ancestral ordering of the nodes, where each node 
is preceded by its parents. 

In the second step, the values of all variables of a new candidate solution are generated according 
to the computed ordering. Since the algorithm generates the variables according to the ancestral 
ordering, when the algorithm attempts to generate the value of a variable, the parents of this 
variable must have already been generated. Given the values of the parents of a variable, the 
distribution of the values of the variable is given by the corresponding conditional probabilities. 
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Figure 2: Basic procedure of the incremental Bayesian optimization algorithm (iBOA). 

4 Incremental BOA (iBOA) 

This section outlines iBOA. First, the basic procedure of iBOA is briefly outline. Next, the proce- 
dures used to update the structure and parameters of the model are described and it is discussed 
how to combine these components. Finally, the benefits and costs of using iBOA are analyzed 
briefly. 

4.1 iBOA: Basic Procedure 

The basic procedure of iBOA is similar to that of other incremental ED As. The model is initialized 
to the probability vector that encodes the uniform distribution over all binary strings; all entries 
in the probability vector are thus initialized to 0.5. 

In each iteration, several solutions are generated from the current model. Then, the generated 
solutions are evaluated. Given the results of the evaluation, the best and the worst solution out 
of the generated set of solutions are selected (the winner and the loser). The winner and the loser 
are used to update the parameters of the model. In some iterations, model structure is updated 
as well to reflect new dependencies that are supported by the results of the previous tournaments. 
The basic iBOA procedure is visualized in figure [2l 

There are two main differences between BOA and iBOA in the way the model is updated. First 
of all, the parameters must be updated incrementally because iBOA does not maintain an explicit 
population of solutions. Second, the model structure also has to be updated incrementally without 
using a population of strings to learn the structure from. The remainder of this section discusses 
the details of iBOA procedure. Specifically, we discuss the challenges that must be addressed in 
the design of the incremental version of BOA. Then, we present several approaches to dealing with 
these challenges and detail the most important iBOA components. 

4.2 Updating Parameters in iBOA 

The parameter updates are done similarly as in cGA. Specifically, iBOA maintains an array of 
marginal probabilities for each string position given other positions that the variable depends on 
or that the variable may depend on. Let us denote the winner of the tournament (best solution) 
by w = {wi, . . . jWn) and the loser (worst solution) by I = (/i, . . . , /,„). A marginal probability 
p(X^(i) = • • • = Xf3(k)) of order k with the positions specified by /3(-), denoted by by 
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Figure 3: Updating parameters in iBOA proceeds by adding 1/A^ to the marginal probability 
consistent with the winner and subtracting 1/N from the marginal probability consistent with the 
loser. When the winner and the loser both point to the same entry in the probability table, the 
table remains the same. 



• • • i^/3(/c)) for brevity, is updated as follows: 



'p{xf3(i),- ■ ■ , xp(j,)) + i if Vj : = x^q) and 3j : /^(j) / X/jq-) 
p(x^(i), . . . , - ^ if 3j : wpi^j) / and Vj : /^(j) = X/jq-) 
,P(2;/3(1), • • • , otherwise 



The above update rule increases each marginal probability by 1/A^ if the specific instance is 
consistent with the winner of the tournament but it is inconsistent with the loser. On the other hand, 
if the instance is consistent with the loser but not with the winner, the probability is decreased 
by 1/iV. This corresponds to replacing the winner by the loser in a population of N candidate 
solutions. For any subset of variables, at most two marginal probabilities change in each update 
because we only change marginal probabilities for the assignments consistent with either the winner 
or the loser of the tournament. See figure [3] for an example of the iBOA update rule for marginal 
probabilities. 

The conditional probabilities can be computed from the marginal ones. Thus, with the update 
rule for the marginal probabilities, iBOA can maintain any marginal and conditional probabilities 
necessary for sampling and structural updates. 

While it is straightforward to initialize any marginal probability under the assumption of the 
uniform distribution and to update the marginal probabilities using the results of a tournament, 
one question remains open — what marginal probabilities do we actually need to maintain when we 
do not know how the model structures will look a priori? Since this question is closely related to 
structural updates in iBOA, we discuss it next. 



4.3 Updating Model Structure in iBOA 

In all incremental EDAs proposed in the past, already at the beginning of the run it is clear what 
probabilities have to be maintained. In cGA and PBIL, the only probabilities we have to maintain 
are the univariate probabilities for different string positions. In the dependency-tree EDA, we also 
have to maintain pairwise probabilities. But what probabilities do we need to maintain in iBOA? 
This issue poses a difficult challenge because we do not know model structure a priori and that is 
why it is not clear what conditional and marginal probabilities we will need. 
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Let us first focus on structural updates and assume that the current model is the probability 
vector (Bayesian network with no edges). To decide on adding the first edge Xi Xj based on the 
BIC metric or any other standard scoring metric, we need to have pairwise marginal probabilities 
p{Xi, Xj). In general, let us consider a string position Xi with the set of parents Ilj. To decide on 
adding another parent Xj to the current set of parents of Xi using the BIC metric or any other 
standard scoring metric, we also need probabilities p{Xi, Xj,Ili). 

If we knew the current set of parents of each variable, to evaluate all possible edge additions, for 
each variable, we would need at most (n — 1) marginal probability tables. Overall, this would result 
in at most n(n — 1) = O(n^) marginal probability tables to maintain. However, since we do not 
know what the set of parents of any variable will be, even if we restricted iBOA to contain at most 
k parents for any variable, to consider all possible models and all possible marginal probabilities, 
we would need to maintain at least (^) marginal probability tables. Of course, maintaining 
probability tables for relatively large n will be intractable even for moderate values of k. This raises 
an important question — can we do better and store a more limited set of probability tables without 
sacrificing model-building capabilities of iBOA? 

To tackle this challenge, for each variable Xi, we are going to maintain several probability 
tables. First of all, for any variable Xi, we will maintain the probability table for p{Xi, Hi), which 
is necessary for specifying the conditional probabilities in the current model. Additionally, for Xi, 
we will maintain probability tables p{Xi, Xj,Ili) for all Xj that can become parents of Xi, which 
are necessary for adding a new parent to X^. This will provide iBOA not only with the probabilities 
required to sample new solutions, but also those required to make a new edge addition ending in 
an arbitrary node of the network. Overall, the number of subsets for which the probability table 
will be maintained will be upper bounded by 0{n?), which is a significant reduction from VL{n^^'^) 
for any k > 2. 

Nonetheless, we still must resolve the problem of adding new marginal probabilities once we 
make an edge addition. Specifically, if we add an edge Xj — > Xj, to add another edge ending in Xi, 
we will need to store probabilities p{Xi,Xj,Xk,Ili) where k denotes the index of any other variable 
that can be added as a parent of Xi. While it is impossible to obtain an exact value of these 
probabilities unless we would maintain them from the beginning of the run, one way to estimate 
these parameters is to assume independence of X^ and {Xi,Xj,Ili), resulting in the following rule 
to initialize the new marginal probabilities: 

p{Xi,Xj,Xk,Ui)=p{Xk)p{Xi,X„Ui). (!) 

Once the new marginal probabilities are initialized, they can be updated after each new tourna- 
ment using the update rule presented earlier. Although the above independence assumption may 
not hold in general, if the edge X^ — > Xi is supported by future instances of iBOA, the edge will 
be eventually added. While other approaches to dealing with the challenge of introducing new 
marginal probabilities are possible, we believe that the strategy presented above should provide 
robust performance as is also supported by the experiments presented later. At the same time, 
after adding an edge Xj Xi, we can eliminate probabilities p(Xi,Ili) from the set of probability 
tables, because this probability table will not be necessary anymore. 

Initially, when the model contains no edges, the marginal probabilities p{Xi,Xj) for all pairs of 
variables Xi and Xj must be stored for the first round of structural update and updated after each 
tournament. Later in the run, the marginal probabilities for each variable will be changed based 
on the structure of the model and the results of the tournaments. 

An example sequence of model updates with the corresponding sets of marginal probability 
tables for a simple problem of n = 5 bits is shown in figure HI Each time a new marginal probability 
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Initial model: Xi , X2 , X3 , , X5 

Probabilities stored: 
p{Xi),p{X2),p{X3),piX4),p{X5), 
piXi,X2),p{Xi,X3),p{Xi,X^),p{Xi,X5), 

piX2,Xi),p{X2,X3),p{X2,Xi),p{X2,X5), 
p{X3,Xi),p{X3,X2),p{X3,X4),p{X3,X5), 
p(X4,Xi),p(X4,X2),p(X4,X3),p(X4,X5) 
p(X5 , Xi ) , p(X5 , X2) , piX5 , X3) , p(X5 , X4) 

Added edge: X3 ^ X2 

New model: Xi, X2 ^ X3, X^, X^, X5 

Probabilities stored: 
p{X,),p{X2,X3),p{X3),p{X^),p{X5), 

, X2 ) , , X3 ) , , X4) , , X5 ) , 

p{X2,X3,Xi),p{X2,X3,Xi),p{X2,Xs,X5), 
p(X3,Xi),p(X3,X4),p(X3,X5), 
KX4,Xi),p(X4,X2),p(X4,X3),p(X4,X5) 
p(X5,Xi),p(X5,X2),p(X5,X3),p(X5,X4) 

Added edge: X2 Xi 

New model: Xi ^ X2, X2 ^ X3, X3, X4, X5 
Probabilities stored: 

p(Xi,X2),p(X2,X3),p(X3),p(X4),p(X5), 
p(Xi,X2,X3),p(Xi,X2,X4),p(Xi,X2,X5), 
p{X2,X3,Xi),p{X2,X3,X5), 
piX3,X^),p{X3,X5), 

p{Xi,Xi),p{X^,X2),piXi,Xs),piXi,X5) 

p{X5,Xi),p{X5,X2),piX5,X3),piX5,X^) 



(for current structure) 
(for new parents of Xi) 
(for new parents of X2) 
(for new parents of X3) 
(for new parents of X4) 
(for new parents of X^) 



(for current structure) 
(for new parents of Xi ) 
(for new parents of X2) 
(for new parents of X3) 
(for new parents of X4) 
(for new parents of X5) 



(for current structure) 
(for new parents of Xi ) 
(for new parents of X2) 
(for new parents of X3) 
(for new parents of X4) 
(for new parents of X5) 



Figure 4: iBOA stores all marginal probabilities for the current structure and those required for 
adding a new edge into any node. Since some edges are disallowed (due to cycles), some probabilities 
may be omitted. In the above example, for clarity, some marginal probabilities are repeated (this 
would not be done in the actual implementation). 



table is added, its values are initialized according to equation [TJ 



4.4 Sampling New Solutions 

There is no difference between the Bayesian network learned in BOA and iBOA. Therefore, the 
same sampling algorithm as in BOA can be used in iBOA. Specifically, the variables are first 
topologically ordered and for each string, the variables are generated according to the generated 
ancestral ordering using the conditional probabilities stored in the model as described in section [331 



4.5 Strategies for Combining Components of iBOA 

There are several strategies for combining all the iBOA components described above together. This 
section briefly reviews and discusses several of these strategies. 

The first approach is to perform continuous updates of both the structure as well as the pa- 
rameters. After performing each tournament, all probabilities will be updated first, and then the 
structure will be updated by adding any new edges that lead to an improvement of model quality. 
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Incremental BOA (iBDA) 
t := 0; 

B := probability vector (no edges); 

p := marginal probabilities for B assuming uniform distribution; 
while (not done) { 

generate k solutions from B with probabilities p; 

evaluate the generated solutions; 

update p using the new solutions; 

update B using the new p; 

t := t+1; 

>; 

Figure 5: Pseudocode of the incremental Bayesian optimization algorithm (iBOA). Model structure 
is denoted by B, marginal probabilities are denoted by p. Depending on the variant of iBOA, some 
structural updates may be skipped. 

The second approach attempts to simulate BOA somewhat closer by updating the structure 
only once in N iterations; only the probabilities for sampling new solutions will be updated in 
every iteration of iBOA. This will significantly reduce the complexity of structural updates and 
improve the overall efficiency. While the model structure will not be updated as frequently as in 
the first approach, the structural updates might be more accurate because of forcing the metric to 
use more data to make an adequate structural update. 

The third approach removes the steady-state component of iBOA and updates both the prob- 
abilities for sampling new solutions as well as the model structure only once in every N iterations. 
That means that until the next structural update, the probability distribution encoded by the cur- 
rent model remains constant, and it is only changed once new edges have been added to the new 
values after the last N parameter updates. 

All above approaches can be implemented efficiently, although in practice it appears that the 
second approach performs the best. The basic procedure of iBOA is outlined in figure [H 

4.6 Benefits and Costs 

Clearly, the main benefit of using iBOA instead of the standard BOA is that iBOA eliminates 
the population and it will thus reduce the memory requirements of BOA. This can be especially 
important when solving extremely big and difficult problems where the populations may become 
very large. iBOA also provides the first incremental EDA capable of maintaining multivariate 
probabilistic models built with the use of multivariate statistics. 

Nonetheless, eliminating the population size also brings disadvantages. First of all, it becomes 
difficult to effectively maintain diversity using niching, such as restricted tournament selection, 
because niching techniques typically require an explicit population of candidate solutions. While 
it might be possible to design specialized niching techniques that directly promote diversity by 
modifying the probabilistic model in some way, doing this seems to be far from straightforward. 
Second, while iBOA reduces memory complexity of BOA by eliminating the population, it still 
is necessary to store the probabilistic model including all marginal probabilities required to make 
new edge additions. Since the marginal probability tables may require even more memory than the 
population itself, the memory savings will not be as significant as in cGA or PBIL. Nonetheless, as 
discussed in the section on future work, this problem may be alleviated by using local structures 
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in Bayesian networks, such as default tables or decision trees. 



5 Experiments 

This section presents experimental results obtained with iBOA on concatenated traps of order 4 
and 5, and compares performance of iBOA to that of the standard BOA. 



5.1 Test Problems 

To test iBOA, we used two separable problems with fully deceptive subproblems based on the 
well-known trap function: 

• Trap- 4- In trap-4 [HE], the input string is first partitioned into independent groups of 4 bits 
each. This partitioning is unknown to the algorithm and it does not change during the run. 
A 4-bit fully deceptive trap function is applied to each group of 4 bits and the contributions 
of all trap functions are added together to form the fitness. The contribution of each group 
of 4 bits is computed as 

iy\ = {^ if ti = 4 

1^ 3 — u otherwise ' 

where u is the number of Is in the input string of 4 bits. The task is to maximize the function. 
An n-bit trap-4 function has one global optimum in the string of all Is and (2"/^ — 1) other 
local optima. Traps of order 4 necessitate that all bits in each group are treated together, 
because statistics of lower order are misleading. Since hBOA performance is invariant with 
respect to the ordering of string positions [24], it does not matter how the partitioning into 
4-bit groups is done, and thus, to make some of the results easier to understand, we assume 
that trap partitions are located in contiguous blocks of bits. 

• Trap-5. In trap-5 [HITj, the input string is also partitioned into independent groups but in 
this case each partition contains 5 bits and the contribution of each partition is computed 
using the trap of order 5: 

trap,{u) = { 4 _ ^ iLZL ' 

where u is the number of Is in the input string of 5 bits. The task is to maximize the function. 
An n-bit trap-5 function has one global optimum in the string of all Is and (2"/^ — 1) other 
local optima. Traps of order 5 necessitate that all bits in each group are treated together, 
because statistics of lower order are misleading. 



5.2 Description of Experiments 

Although iBOA does not maintain an explicit population of candidate solutions, it still uses the 
parameter which loosely corresponds to the actual population size in the standard, population- 
based BOA. Thus, while iBOA is population-less, we still need to set an adequate population size 
to ensure that iBOA finds the global optimum reliably. We used the bisection method |3H 
to estimate the minimum population size to reliably find the global optimum in 10 out of 10 
independent runs. To get more stable results, 10 independent bisection runs were repeated for each 
problem size and thus the results for each problem size were averaged over 100 successful runs. 
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The number of generations was upper bounded by the number of bits n, based on the convergence 
theory for BOA \22\ [35| \W\ [24] and preUminary experiments. For iBOA, the number of generations 
is defined as the ratio of the number of iterations divided by the population size. 

In iBOA, the number of solutions in each tournament was set to A; = 4 based on preliminary 
experiments, which showed that this value of k performed weh. To use selection of similar strength 
in BOA, we used the tournament selection with tournament size 4. Although these two methods 
are not equivalent, they should perform similarly. In both BOA and iBOA, BIG metric was used 
to evaluate competing network structures in model building and the maximum number of parents 
was not restricted in any way. In iBOA, the model structure is updated once in every iterations, 
while the sampling probabilities are updated in each iteration. Finally, in BOA, the new population 
of candidate solutions replaces the entire original population; while this setting is not optimal 
(typically elitist replacement or restricted tournament replacement would perform better |24l I25| ). 
it was still the method of choice to make the comparison fair because iBOA does not use any elitism 
or niching either. 

Although one of the primary goals of setting up BOA and iBOA was to make these algorithms 
perform similarly, the comparison of these two algorithms is just a side product of our experiments. 
The most important goal was to provide empirical support for the ability of iBOA to discover 
and maintain a multivariate probabilistic model without using an explicit population of candidate 
solutions. We also tried the original cGA; however, due to the use of the simple model in the form 
of the probability vector, cGA was not able to solve problems of size n > 20 even with extremely 
large populations and these results were thus omitted. 

5.3 Results 

Figure [6] shows the number of evaluations required by iBOA to reach the global optimum on 
concatenated traps of order 4 and 5. In both cases, the number of evaluations grows as a low-order 
polynomial; for trap-4, the growth can be approximated as 0{n^'^^), whereas for trap-5, the growth 
can be approximated as 0{n^'^^). While the fact that the number of evaluations required by iBOA 
scales worse on trap-5 than on trap-4 seems somewhat surprising, both cases are relatively close to 
the bound predicted by BOA scalability theory |29[ I24j. which estimates the growth as 0(n^'^^). 

The low-order polynomial performance of iBOA on trap-4 and trap-5 provides strong empirical 
evidence that iBOA is capable of finding an adequate problem decomposition because models that 
would fail to capture the most important dependencies on the fully deceptive problems trap-4 and 
trap-5 would fail to solve these problems scalably [H [3^ [TO]. 

Figure [6] shows the number of evaluations required by standard BOA to reach the global op- 
timum on concatenated traps of order 4 and 5. In both cases, the number of evaluations grows 
as a low-order polynomial; for trap-4, the growth can be approximated as 0(n^'^^), whereas for 
trap-5, the growth can be approximated as 0{n^'^^). In both cases, we see that BOA performs 
worse than predicted by scalability theory [29\ [2jj , which is most likely because of using an elitist 
replacement strategy, which significantly alleviates the necessity of having accurate models in the 
first few iterations [23], and because of the potential for too strong pressure towards overly simple 
models due to the use of BIG metric to score network structures. In any case, we can conclude 
that iBOA not only keeps up with standard BOA, but without an elitist replacement strategy or 
niching, it even outperforms BOA with respect to the order of growth of the number of function 
evaluations with problem size. 
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Figure 6: Performance of iBOA on concatenated traps of order 4 and 5. 



6 Future Work 

While the experiments confirmed that iBOA is capable of learning multivariate models incremen- 
tally without using a population of candidate solutions, the advantages of doing this are counter- 
balanced by the disadvantages. Most importantly, there are two issues that need to be addressed 
in future work: (1) Complexity of model representation should be improved using local structures 
in Bayesian networks, such as default tables [9] or decision trees/graphs |5l |9]. (2) Elitist and 
diversity-preservation techniques should be incorporated into iBOA to improve its performance. 
Without addressing these difficulties, the advantages of using iBOA instead of BOA are somewhat 
overshadowed by the disadvantages. 

7 Summary and Conclusions 

This paper proposed an incremental version of the Bayesian optimization algorithm (BOA). The 
proposed algorithm was called the incremental BOA (iBOA). Just like BOA, iBOA uses Bayesian 
networks to model promising solutions and sample the new ones. However, iBOA does not maintain 
an explicit population of candidate solutions; instead, iBOA performs a series of small tournaments 
between solutions generated from the current Bayesian network, and updates the model incre- 
mentally using the results of the tournaments. Both the structure and parameters are updated 
incrementally. 

The main advantage of using iBOA rather than BOA is that iBOA does not need to main- 
tain a population of candidate solutions and its memory complexity is thus reduced compared to 
BOA. However, without the population, implementing elitist and diversity-preservation techniques 
becomes a challenge. Furthermore, memory required to store the Bayesian network remains sig- 
nificant and should be addressed by using local structures in Bayesian networks to represent the 
models more efficiently. Despite the above difficulties, this work represents the first step toward the 
design of competent incremental EDAs, which can build and maintain multivariate probabilistic 
models without using an explicit population of candidate solutions, reducing memory requirements 
of standard multivariate estimation of distribution algorithms. 
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Figure 7: Performance of standard BOA on concatenated traps of order 4 and 5. 
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